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Abstract 



Functional Magnetic Resonance Imaging studies of emotion, personality, and social cognition have drawn 
much attention in reeent years, with high-profile studies frequently reporting extremely high (e.g., >.8) 
eorrelations between behavioral and self-report measures of personality or emotion and measures of brain 
activation. We show that these eorrelations often exceed what is statistically possible assuming the (evidently 
rather limited) reliability of both fMRI and personality/emotion measures. The implausibly high eorrelations 
are all the more puzzling because method sections rarely contain sufficient detail to ascertain how these 
eorrelations were obtained. We surveyed authors of 54 artieles that reported findings of this kind to 
determine a few details on how these correlations were computed. More than half acknowledged using a 
strategy that computes separate correlations for individual voxels, and reports means of just the subset of 
voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates 
correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the 
vast majority of the implausibly high correlations in our survey sample. In addition, we argue that other 
analysis problems likely ereated entirely spurious eorrelations in some eases. We outline how the data from 
these studies eould be reanalyzed with unbiased methods to provide the field with accurate estimates of the 
correlations in question. We urge authors to perform such reanalyses and to correct the scientific record. 



A Puzzle: Remarkably High Correlations in 
fMRI studies on emotion, personality, and 
social cognition 

Functional Magnetic Resonance Imaging 
studies of emotion, personality, and social 
cognition scarcely existed 10 years ago, and 
yet the field has already achieved a 
remarkable level of attention and prominence. 
Within the space of a few years, it has 
spawned several new journals {Social 
Neuroscience, Social Cognitive and Ajfective 
Neuroscience), and is the focus of substantial 
new funding initiatives (National Institute of 
Mental Health, 2007), lavish attention from 
the popular press (Hurley, 2008) and the trade 
press of the psychological research 
community (e.g., APS Observer, Fiske, 2003). 
Perhaps even more impressive, however, is 
the number of papers from this area that have 
appeared in such prominent journals as 
Science, Nature, and Nature Neuroscience. 

While the questions and methods used in such 
research are quite diverse, a substantial 
number of widely cited papers in this field 
have reported a specific type of empirical 
finding that appears to bridge the divide 
between mind and brain; extremely high 



correlations between measures of individual 
differences relating to personality, emotion, 
and social cognition, and measures of brain 
activity obtained with functional magnetic 
resonance imaging (fMRI). We focus on these 
studies^ here because this was the area where 
these correlations came to our attention; we 
have no basis for concluding that the problems 
discussed here are necessarily any worse in 
this area than in some other areas. 

To take but a few examples of many studies 
that will be discussed below: 

Eisenberger, Lieberman, and Williams (2003), 
writing in Science, described a game they 
created to expose individuals to social 
rejection in the laboratory. The authors 
measured the brain activity in 13 individuals 
at the same time as the actual rejection took 
place, and later obtained a self-report measure 
of how much distress the subject had 
experienced. Distress was correlated at r=.88 



* Studies of the neural substrates of emotion, 
personality and social cognition rely on many methods 
besides fMRI and PET, including EEG and MEG, 
animal research (e.g., cross-species comparisons), 
neuroendocrine, and neuroimmunological 
investigations (Harmon-Jones & Winkielman, 2007). 
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with activity in the anterior cingulate cortex 
(ACC). 

In another Science paper, Singer et al. (2004) 
found that the magnitude of differential 
activation within the ACC and left insula 
induced by an empathy-related manipulation 
was correlated between .52 and .72 with two 
scales of emotional empathy (the Empathic 
Concern Scale of Davis, and the Balanced 
Emotional Empathy Scale of Mehrabian). 

Writing in Neuroimage, Sander et al. (2005) 
reported that a subject's proneness to anxiety 
reactions (as measured by an index of the 
Behavioral Inhibition System; Carver and 
White, 1994) correlated at r=.96 with the 
difference in activation of the right cuneus to 
attended versus ignored angry speech. 

In the review below, we will encounter many 
studies reporting similar sorts of correlations. 

The work that led to the present article began 
when the present authors became puzzled 
about how such impressively high correlations 
could arise. We describe our efforts to resolve 
this puzzlement, and the conclusions that our 
inquiries have led us to. 

Why should it be puzzling to find high 
correlations between brain activity and social 
and emotional measures? After all, if new 
techniques are providing a deeper window on 
the link between brain and behavior, does it 
not make sense that researchers should be able 
to find the neural substrates of individual 
traits — and thus potentially bring to light 
stronger relationships than have often been 
found in purely behavioral studies? 

The problem is this: It is a statistical fact (first 
noted by researchers in the field of classical 
psychometric test theory) that the strength of 
the correlation observed between measures A 
and B (robservedA.observedB ) reflects not only the 
strength of the relationship between the traits 
underlying A and B (rA,B), but also the 
reliability of the measures of A and B 



(reliabilityA and reliabilityB, respectively). In 
general, 

I*ObservedA,ObservedB — 

fA,B * sqrt (reliabilityA * reliabilityB) 

Thus, the reliabilities of two measures provide 
an upper bound on the possible correlation 
that can be observed between the two 
measures (Nunnally, 1970) . 

Reliability Estimates 

So what are the reliabilities of fMRI and 
personality/emotional measures likely to be ? 
The reliability of personality and emotional 
scales varies between measures, and according 
to the number of items used in a particular 
assessment. However, test-retest reliabilities 
as high as .8 seem to be relatively uncommon, 
and usually found only with large and highly 
refined scales. Viswesvaran and Ones (2000) 
surveyed many studies on the reliability of the 
Big Eive factors of personality, and concluded 
that the different scales have reliabilities 
ranging from .73 to .78. Hobbs and Eowler 
(1974) carefully assessed the reliability of the 
sub-scales of the MMPI, and found numbers 
ranging between .66 and .94, with an average 
of .84. In general, therefore, a range of .7 - .8 
would seem to be a somewhat optimistic 
estimate for the smaller and more ad hoc 



^ This is the case because the correlation coefficient is 
defined as the ratio between the covariance of two 
measures and the product of their standard deviations: 
xy . Real-world measurements will be 



corrupted by (independent) noise, thus the standard 
deviations of the measured distributions will be 
increased by the additional noise (whose magnitude is 
assessed by the measure’s reliability). This will make 
the measured correlation lower than the true underlying 
correlation, by a factor equal to the geometric mean of 
reliabilities. 

3 

We consider test-retest reliabilities here (rather than 
inter-item, or split-half reliability) because, for the most 
part, the studies we discuss gathered behavioral 
measure at different points in time than the fMRI data. 
In any case, internal reliability measures, like 
coefficient alpha, do not generally appear to be much 
higher in this domain. 
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scales used in much of the research described 
below, which could well have substantially 
lower reliabilities. 

Less is known about the reliability of blood 
oxygenation level dependent (BOLD) signal 
measures in fMRI, but some relevant studies 
have recently been performed"*. Kong et al. 
(2006) had subjects engage in six sessions of a 
finger tapping task while recording brain 
activation. They found test-retest correlations 
of the change in BOLD signal ranging 
between 0 and .76 for the set of areas that 
showed significant activity in all sessions^. 
Manoach et al (2001, their figure 1, p. 956) 
scanned subjects on two sessions of 
performance on the Sternberg memory 
scanning task, and found reliabilities ranging 
between .23 to .93, averaging .60. Aron, 
Gluck, and Poldrack (2006) had people 
perform a classification learning task on two 
separate occasions widely separated in time, 
and found voxel-level reliabilities with modal 
values (see their figure 5, p. 1005) a little bit 
below .8^. Johnstone et al. (2005, p. 1118) 
examined the stability of amygdala BOLD 
response to presentations of fearful faces in 
multiple sessions. Intraclass correlations for 
left and right amygdale regions of interest 
were in the range of .4 to .7 for the 2 sessions 
separated by 2 weeks. Thus, from the 
literature that does exist, it would seem 
reasonable to suppose that fMRI measures 
computed at the voxel level will not often 
have reliabilities greater than about .7. 



'' We focus here on studies that look at the reliability of 
BOLD activation measures, rather than the reliability of 
patterns of voxels exceeding specific thresholds, which 
tend to be substantially lower (e.g., Stark et al., 2004). 

^ It seems likely that restricting the reliability analysis 
to regions consistently active in all sessions would tend 
to overestimate the reliability of BOLD signal in 
general. 

® They found somewhat higher reliabilities for voxels 
within a frontostriatal system that they believed was 
most specifically involved in carrying out the 
probabilistic classification learning. 



The Puzzle 

This, then, is the puzzle. Measures of 
personality and emotion evidently do not often 
have reliabilities greater than .8. 
Neuroimaging measures seem typically to be 
reliable at .7 or less. If we assume that a 
neuroimaging study is performed in a case 
where the underlying correlation between 
activation in the brain area and the individual 
difference measure (i.e., the correlation that 
would be observed if there were no 
measurement error) is perfect then the highest 
possible meaningful correlation that could be 
obtained would be sqrt(.8 * .7), or .74. 
Surprisingly, correlations exceeding this upper 
bound are often reported in recent fMRI 
studies on emotion, personality, and social 
cognition. 

Meta-Analysis Methods 

We turned to the original papers to find out 
how common these remarkable correlations 
are, and what analyses might be yielding them. 
Unfortunately, after a brief review of several 
articles, it became apparent that the analyses 
employed varied greatly from one investigator 
to the next, and the exact methods were 
simply not made clear in the typically brief 
and sometimes opaque method sections. 

To probe the issue further, we conducted a 
survey of the investigators. We proceeded as 
follows: First, we attempted to pull together as 
large a sample as we could readily achieve of 
the literature reporting correlations between 
evoked BOLD activity and behavioral 



^ There are several reasons why a true correlation of 1 .0 
seems highly unrealistic. First, for any behavioral trait, 
it is far-fetched to suppose that only one brain area 
influences this trait. Second, even if the neural 
underpinnings of a trait were confined to one particular 
region, it would seem to require an extraordinarily 
favorable set of coincidences for the BOLD signal 
(basically a blood flow measure) assessed in one 
particular stimulus or task contrast to capture all 
function relevant to the behavioral trait, which after all 
reflects the organization of complex neural circuitry 
residing in that brain area. 
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measures of individual differences in 

personality, emotionality, social cognition, 
and related domains. Then we emailed the 
authors of the articles we identified, sending a 
brief survey to determine how the reported 
correlation values were computed. 

Literature Review 

Our literature review was conducted using the 
keyword “fMRI” (and variants), in 

conjunction with a list of social terms (e.g., 
“jealousy”, “altruism”, “personality”, “grief’, 
etc.). Within the articles retrieved by these 
searches, we selected all the articles we could 
find that reported across-subject correlations 



between a trait measure and evoked BOLD 
activity. This resulted in 54 articles, with 256 
significant correlations between BOLD signal 
and a trait measure. It should be emphasized 
that we do not suppose this literature review to 
be exhaustive. Undoubtedly we missed some 
papers reporting these kinds of numbers, but 
our sample seems likely to be quite 
representative, perhaps slanted toward papers 
that appeared in higher impact journals. 

A histogram of these significant correlations is 
displayed in Figure 1. It can be seen that 
correlations in excess of .75 are plentiful 
indeed. 




Figure 1 : A histogram of the correlations between evoked BOLD response and behavioral measures 
of individual differences seen in the studies identified for analysis in the current article. 
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We turn next to the question: where do these 
numbers come from? Before doing so, we 
have to provide a bit of background for 
readers unfamiliar with methods in this area. 

Elements offMRI Analysis 

For those not familiar with fMRI analysis, the 
essential steps in just about any neuroimaging 
study can be described rather simply (those 
familiar with the techniques may wish to skip 
this section). The output of an fMRI 
experiment typically consists of two types of 
“3D pictures” {image volumes): “anatomical” 
(a high resolution scan that shows anatomical 
structure, not function) and “functional”. 
Functional image volumes are lower 
resolution scans showing measurements 
reflecting, among other things, the amount of 
deoxygenated hemoglobin in the blood - 
blood oxygenation level dependent (BOLD) 
signal. A functional image volume is 
composed of many measurements of the 
BOLD signal in small, roughly cube-shaped, 
regions called “voxels” (‘volumetric pixels’). 
The number of voxels in the whole image 
volume depends on the scanner settings, but it 
typically ranges between 10x64x64 and 
30x128x128 voxels. Thus, each functional 
image contains somewhere between 40,000 
and 500,000 voxels, with each of these voxels 
covering between 1 mm (1x1x1 mm) and 125 
mm (5x5x5mm) of brain tissue (except for 
voxels outside of the brain). A new functional 
image volume is usually acquired every 2 or 3 
seconds (TR, or repetition time) during a scan, 
so one ends up with a timeseries of these 
functional images. 

These data are typically preprocessed to 
reduce noise and to allow comparisons 
between different brains. The preprocessing 
usually includes smoothing (averaging each 
voxel with its neighbors, weighted by some 
function that falls with distance, such as a 
Gaussian). The studies we focus on here 
ultimately compute correlations across 
subjects: in this kind of study, the voxels are 
usually mapped onto an average brain 



(although not always, e.g., Yovel & 
Kanwisher, 2005). A number of average- 
brain models exist, the most famous being 
Talairach (Talairach & Toumoux, 1988) and 
MNI (Evans et al. 1993), but some 
investigators compute an average brain model 
for their particular subjects, and normalize 
their functional image scans onto that model. 

Following pre-processing, some measure of 
the activation in a given voxel needs to be 
derived to assess if it is related to what the 
person is doing, seeing, or feeling. The 
simplest procedure is just to extract the 
average activation in the voxel while the 
person does a task. However, because any 
task will engage most of the brain (from visual 
cortex to see the stimulus, to motor cortex to 
produce a response, and everything in 
between), fMRI researchers typically focus 
not on the activation in particular voxels 
during one task, but rather on a contrast 
between the activation arising when the 
person performs one task versus the activation 
arising when they do another. This is usually 
measured as follows: while functional images 
are being acquired, the subject does a mixed 
sequence of two different tasks 
(A,B,B,A,A,B,A, and so forth — where A 
might be reading words and B might be 
looking at nonlinguistic patterns). Thus, the 
experimenter ends up with two different time 
series to compare: the sequence of tasks the 
person performed and, separately for each 
voxel, the sequence of activation levels 
measured at that voxel. A regression analysis 
can now be performed to ask: “is this voxel’s 
activity different when the subject was 
performing Task A compared to Task B”? 

These basic steps common to most fMRI data 
analyses yield matrices consisting of tens or 
hundreds of thousands of numbers indicating 
activation levels. These can be (and indeed 
generally are) displayed as images. However, 
to obtain quantitative summaries of these 
results and do further statistics on them (such 
as correlating them with behavioral 
measures — the topic of the present article), an 




investigator must somehow select a subset of 
voxels and aggregate measurements across 
them. This can be done in various ways. A 
subset of voxels in the whole brain image may 
be selected based on purely anatomical 
constraints (e.g., all voxels in a region 
generally agreed to represent the amygdala, or 
all voxels within a certain radius of some a 
priori specified brain coordinates). 
Alternatively, regions can be selected based 
on “functional constraints”: meaning voxels 
are selected based on their activity pattern in 
functional scans. For example, one could 
select all the voxels for a particular subject 
that responded more to reading than to non- 
linguistic stimuli. Finally, voxels could be 
chosen based on some combination of 
anatomy and functional response. 

In the papers we are focusing on here, the 
final result, as we have seen, was always a 
correlation value — a correlation between each 
person’s score on some behavioral measure, 
and some summary statistic of their brain 
activation. The latter summary statistic 
reflects the activation or activation contrast 
within a certain set of voxels. In either case, 
the critical question is: how was this set of 
voxels selected? As we have seen, voxels 
may be selected based on anatomical criteria, 
functional criteria, or both. Within these 
broad options, there are a number of 
additional more fine-grained choices. It is 
hardly surprising, then, that brief method 
sections rarely suffice to describe how the 
analyses were done in adequate detail to really 
understand what choices were being made. 

Survey methods 

To learn more than the Method sections of 
these papers disclosed about the analyses that 
yielded these correlations, we emailed the 
corresponding authors of these articles. The 
exact wording of our questions is included in 
Appendix 1, but we often needed to send 



customized follow-up questions to figure out 
the exact details when the survey questions 
were misunderstood, or did not match our 
reading of the methods section. 

In our survey we first inquired whether the 
fMRI signal measure that was correlated 
across subjects with a behavioral measure 
represented the average of some number of 
voxels, or instead, the activity from just one 
voxel that was deemed most informative 
(referred to as the peak voxel). 

If it was the average of some number of 
voxels, we inquired about how those voxels 
were selected - asking whether they were 
selected based only on anatomy, only on the 
activation seen in those voxels, or both? 

If activation was used to select voxels, or one 
voxel was determined to be most informative 
based on its activation, we asked what was the 
measure of activation used. Was it the 
difference in activation between two task 
conditions computed on individual subjects, or 
was it a measure of how this task contrast 
correlated with the individual difference 
measure? 

Finally, if functional data were used to select 
the voxels, were they the same functional data 
as were used to define the reported correlation? 

Survey participants 

Of the 55 articles we found in our review, we 
received methodological details from 53, and 
2 did not respond to repeated requests. 

Survey Results 

We display the raw results from our survey as 
the proportion of studies that investigators 
described with a particular answer to each 
question (Figure 2). Since some questions 
only applied to a subset of participants, we 
display only the proportion of the relevant 
subset of studies. 
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■ Individual subject 
contrast 
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■ Something else 





■ Anatomical only 
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Figure 2. The results of our survey of individual-difference correlation methods between fMRI 
signals and measures of emotion, personality, and social cognition. Of the 55 articles surveyed, the 
authors of 53 provided responses. Of those, 23 reported a correlation between behavior and one 
peak voxel; 30 reported the mean of a number of voxels. For those that reported the mean of a 
subset of voxels, 7 defined this subset purely anatomically, 12 used only functional constraints, and 
1 1 used anatomical and functional constraints. Of the 46 studies that used functional constraints to 
choose voxels (either for averaging, or for finding the ‘peak’ voxel), 10 said they used functional 
measures defined within a given subject, 29 used the across-subject correlation to find voxels, and 7 
did something else. All of the studies using functional constraints used the same data to select voxels, 
and then to measure the correlation. Notably, 54% of the surveyed studies selected voxels based on a 
correlation with the behavioral individual-differences measure, and then used those same data to 
compute a correlation within that subset of voxels. 



The raw answers to our survey do not by 
themselves explain how the (implausibly high, 
or so we have argued) correlations were 
arrived at. The key, we believe, lies in the 
54% of respondents who said that “regression 
across subjects” was the functional constraint 
used to select voxels: indicating that voxels 



were selected because they correlated highly 

Q 

with the behavioral measure of interest. . 

Figure 3 shows very concretely the sequence 
of steps that these respondents reported 
following in analyzing their data. A separate 



^ It is important to note that all of these studies also 
reported using the same data to compute the correlation 
as were initially used to select the subset of voxels. 



correlation across subjects was performed for 
each voxel within a specified brain region. 
Each correlation relates some measure of 
brain activity in that voxel (which might be a 
difference between responses in two tasks or 
in two conditions) with the behavioral 
measure for that individual. Thus, the number 
of correlations computed was equal to the 



number of voxels (meaning that in many cases, 
thousands of correlations were computed). At 
the next stage, the set of voxels for which this 
correlation exceeds some threshold were 
selected, and some measure of the 
relationships for the voxels that exceed this 
threshold was reported. 
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Figure 3: An illustration of the analysis employed by 54% of the papers surveyed, (a) From each 
subject, the researchers obtain a behavioral measure as well as BOLD measures from many voxels. 

(b) The activity in each voxel is correlated with the behavioral measure of interest across subjects. 

(c) From this set of correlations, researchers select those voxels that pass a statistical threshold, and 

(d) aggregate the fMRI signal across those voxels to derive a final measure of the correlation of 
BOLD signal and the behavioral measure. 



What are the implications of selecting voxels 
in this fashion? Such an analysis will inflate 
observed across-subject correlations, and can 
even produce significant measures out of pure 
noise. The problem is illustrated in the simple 
simulation displayed in Figure 4: (a) 

investigator computes a separate correlation of 
the behavioral measure of interest with each of 
the voxels. Then, (b) those voxels that 
exhibited a sufficiently high correlation 
(passing a statistical threshold) are selected. 
Then an ostensible measure of the ‘true’ 



correlation is aggregated from the voxels that 
showed high correlations (e.g., by taking the 
mean of the voxels over the threshold). With 
enough voxels, such a biased analysis is 
guaranteed to produce high correlations even 
if none are truly present (Figure 4). Moreover, 
this analysis will produce visually pleasing 
scattergrams (e.g.. Figure 4c) that will provide 
(quite meaningless) reassurance to the viewer 
that s/he is looking at a result that is solid, 
“not driven by outliers”, etc. 
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Figure 4: A simulation of a non-independent analysis on pure noise data. We simulated 1000 
experiments each with 10 subjects and 10000 voxels, and one individual difference measure. Each 
subjects’ voxel activity and behavioral measure were independent 0-mean Gaussian noise. Thus, (a) 
the true distribution of correlations between the behavioral measure and simulated voxel activity is 
distributed around 0, with random fluctuations resulting in a distribution that spans the range of 
possible correlations, (b) When a subset of voxels are selected for passing a statistical threshold (a 
positive correlation with p<0.01), the observed correlation of the mean ‘activity’ of those voxels is 
very high indeed, (c) If the BOLD activity from that subset of voxels is plotted as a function of the 
behavioral measure, a compelling scattergram may be produced. (For similar exercises in other 
neuroimaging domains see Appendix 2; Baker, Hutchison, et al, 2007; Simmons et al, 2006; 
Kriegeskorte et al, 2008) 



The non-independence error 

The fault seen in glaring form in Figure 4 will 
be referred to henceforth as the non- 
independence error. This approach amounts 
to selecting one or more voxels based on a 
functional analysis, and then reporting the 
results of the same analysis and functional 
data from just the selected voxels. This 
analysis distorts the results by selecting noise 
exhibiting the effect being searched for, and 
any measures obtained from such a non- 
independent analysis are biased and 
untrustworthy (for a formal discussion see Vul 
& Kanwisher, in press). 

It may be easier to appreciate the gravity of 
the non-independence error by transposing it 
outside of neuroimaging. We (the authors of 
this paper) have identified a weather station 



whose temperature readings predict daily 
changes in the value of a specific set of stocks 
with a correlation of r=-0.87. For $50.00, we 
will provide the list of stocks to any interested 
reader. That way, you can buy the stocks 
every morning when the weather station posts 
a drop in temperature, and sell when the 
temperature goes up. Obviously, your 
potential profits here are enormous. But you 
may wonder: how did we find this correlation? 
The figure of -.87 was arrived at by separately 
computing the correlation between the 
readings of the weather station in Adak Island, 
Alaska, with each of the 3315 financial 
instruments available for the New York Stock 
Exchange (through the Mathematica function 
FinancialData) over the 10 days that the 
market was open between November 18* and 
December 3*, 2008. We then averaged the 
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correlation values of the stocks whose 
correlation exceeded a high threshold of our 
choosing, thus yielding the figure of -.87. 
Should you pay us for this investment 
strategy? Probably not: Of the 3,315 stocks 
assessed, some were sure to be correlated with 
the Adak Island temperature measurements 
simply by chance - and if we select just those 
(as our selection process would do), there was 
no doubt we would find a high average 
correlation. Thus, the final measure (the 
average correlation of a subset of stocks) was 
not independent of the selection criteria (how 
stocks were chosen): this, in essence, is the 
non-independence error. The fact that random 
noise in previous stock fluctuations aligned 
with the temperature readings is no reason to 
suspect that future fluctuations can be 
predicted by the same measure, and one would 
be wise to keep one’s money far away from us, 
or any other such investment advisor^. 

Variants of the non-independence error occur 
in many different types of neuroimaging 
studies and in many different domains. The 
non-independence error is by no means 
confined to fMRI studies on emotion, 
personality, and social cognition, nor to 
studies correlating individual behavioral 
differences with evoked fMRI activity. (For 
broader discussions of how non-independent 
analyses produce misleading results in other 
domains, see: Vul & Kanwisher, in press, 
Kriegeskorte et al, 2008; Baker, Hutchinson, 
et al, 2007; Baker, Simmons, et al 2007; 
Simmons et al, 2006). 

Our survey allows us to determine which of 
the studies were committing variants of the 
non-independence error by finding analyses in 
which researchers selected voxels (answered 
A or B to question 1) based on correlation 
with the across-subject behavioral measure of 
interest (answered B or C to question 2, and B 



See Taleb (2004) for a sustained and engaging 
argument that this error, in subtler and more disguised 
form, is actually a common one within the world of 
market trading and investment advising. 



to question 3), then plotted or reported the 
observed correlations from just those voxels 
(answered A to question 4). 

Results and Discussion 

For maximum clarity, we will present the 
results of our survey, and our overall analysis 
of how these results should be interpreted, in 
the form of a number of questions and 
answers. 

A. Are the correlation values reported in this 
literature meaningful? 

Of the 52 articles we successfully surveyed, 
28 provided responses indicating that a non- 
independent analysis, like the one portrayed in 
Figures 3 and 4, was used to obtain the across- 
subject correlations between evoked BOLD 
activity and a measure of individual 
differences. As we saw in Figure 4, a non- 
independent analysis systematically distorts 
any true correlations that might exist. Thus, in 
half of the studies we surveyed, the reported 
correlation coefficients mean almost nothing, 
because they are systematically inflated by the 
biased analysis. The magnitude of this 
distortion depends upon variables (such as the 
number of voxels within the brain, noise and 
signal variance, etc.) which a reader would 
have no way of knowing, so it is not possible 
to correct for it. The problem is exacerbated 
in the case of the 38% of our respondents who 
reported the correlation of the peak voxel (the 
voxel with the highest observed correlation) 
rather than the average of all voxels in a 
cluster passing some threshold. 

Figure 5 shows the histogram of correlation 
values with which our investigation started'^, 
this time color-coded by whether or not such a 
non-independent analysis was employed in the 



Thanks to Lieberman, Berkman and Wager (in press) 
for pointing out that clerical errors in an earlier version 
of this histogram that circulated on the internet had 
resulted in omissions (now corrected). In the course of 
reviewing our files, we also realized that study 55, 
surveyed in April, 2008, was inadvertently omitted 
from the earlier histogram. 
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article. It is reassuring to see that the mode of 
independently acquired (i.e., valid) correlation 
values (coded green) is indeed below the 
‘theoretical upper bound’ we anticipated from 
classical test theory and the limited 
information we have on test reliability 
(described in the introduction). The 
overwhelming trend is for the larger 
correlations to be emerging from non- 
independent analyses that are statistically 
guaranteed to inflate the measured correlation 
values. 

In looking at Figure 5, it is tempting to assume 
that the non-independent (red) correlations, 
had they been measured properly, would have 
values around the central tendency of the 
independent (green) correlations (around 
.6). Thus, one might say, “it is very 



unfortunate that the numbers were seriously 
exaggerated, but the real relationships here are 
still pretty impressive.” In our view, any such 
inference is unwarranted; many of the real 
relationships are probably far lower than the 
ones shown in green. After all, the published 
studies reporting independent measures of 
correlations are still predominantly those that 
found significant effects (resulting in the well 
known publication bias for significant results; 
cf. loannidis, 2005), and correlations much 
lower than .5 would often not have been 
significant with these sample sizes. We would 
speculate that, properly measured, many of the 
"red correlations" would have been far lower 
still, and may not exist at all. (For a 
discussion of the relationship between the 
non-independence error and the use of spatial 
clustering thresholds, see Appendix 2.) 
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Frequency 

(number of significant corneJations reported) 



Independent 
Non-independent 
No Response 



35 



30 



25 




O.Z 0.3 0.4 0.5 0.6 0.7 O.B 0.9 

Absolute correlation value 



Figure 5. The histogram of the correlations values from the studies we surveyed ( same data as 
Figure 1 ), this time, color-coded by whether or not the article from which this analysis originated 
used non-independent analyses. Correlations coded in green correspond to those that were achieved 
with independent analyses, avoiding the bias described in this paper. However, those in red 
correspond to the 54% of articles surveyed that reported conducting non-independent analyses - 
these correlation values are certain to be inflated. Entries in orange arise from papers whose 
authors chose not to respond to our survey. (See Table 1 below for key to article number. Color 
coding corresponds to whether or not the one correlation which we surveyed a particular article 
about was non-independent. There are varying gradations of non-independence; for instance, study 
26 carried out a slightly different, non-independent analysis: instead of explicitly selecting for a 
correlation between lAT and activation, they split the data into two groups, those with high lAT 
scores and those with low lAT scores, they then found voxels that showed a main effect between 
these two groups, and then computed a correlation within those voxels. In Study 23, in which voxels 
were selected on a measure correlated with the final measure of interest. These procedures are also 
not independent, and will also inflate correlations, perhaps to a lesser degree.) 



B. Is the problem being discussed here 
anything different than the well-known 
problem of multiple comparisons raising the 
probability of false alarms? 

Every IMRI study involves vast numbers of 
voxels, and comparisons of one task to 
another involve computing a t-statistic and 
comparing it to some threshold. When 
numerous comparisons are made, adjustments 
of threshold are needed, and are commonly 
employed. The conventional approach 
involves finding voxels that exceed some 
arbitrarily high threshold of significance on a 
particular contrast (e.g., reading a word versus 
looking at random shapes). This multiple 
comparisons correction problem is well 
known and has received much attention. 

The problem we describe arises when authors 
then report secondary statistics on the data in 
the voxels that were selected originally. In the 
case discussed in the present article, 
correlations are both the selection criterion 
and the secondary statistic. 

When people compare reading a word versus 
reading a letter, and find brain areas with a t 
value of 13.2 (with 11 degrees of freedom, 
comparable to an r of .97, or an effect size of 
d=2.4), few people would interpret the t value 



as a measure of effect size. On the other hand, 
in the case of the r values under discussion 
here, we would contend that essentially 
everyone interprets them in that way. 

C. What may be inferred from the 
scattergrams often exhibited in connection 
with non-independent analyses? 

Many of the papers reporting biased 
correlation values display scattergram plots of 
evoked activity as a function of the behavioral 
measure. These plots are presumably included 
in order to show the reader that the correlation 
is not being driven by a few outliers, or by 
other aberrations in the data. However, when 
non-independent selection criteria are used to 
pick out a subset of voxels, the voxels passing 
this criterion will inevitably contain a large 
admixture of noise favoring the correlation 
(see the scattergram in Figure 4c for an 
example of a case where the relationship is 
pure noise). Thus, the shape of the resulting 
scattergrams provides no reliable indication 
about the nature of the possible correlation 
signal underlying the noise, if any. 

D. How can these same methods sometimes 
produce no correlations? 
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It may be as surprising to some readers, as it 
was to us, that a few papers reporting 
extraordinarily high correlations arrived at 
through non-independent analyses also 
reported some negative results (correlations 
that failed to reach significance). If the same 
analysis methods were applied to each 
correlation investigated, shouldn’t the same 
correlation-amplifying bias apply to each one? 

Indeed it should normally do so. However, 
with a bit of investigation, we were able to 
track down the source of (at least some of) the 
inconsistency: in certain papers, the bias 
inherent in non-independent analyses was 
sometimes wielded selectively, in such a way 
as to inflate certain correlations, but not others. 

Take for instance Takahashi et al (2006), 
reporting an interaction in the presence of a 
correlation between evoked BOLD activity 
and rated jealousy in men and women: activity 
in the insula correlated with self-reported 
jealousy about emotional infidelity in men 
(r=0.88), but not women (r=-0.03). The 
opposite was true of activity in the posterior 
STS correlated with such self-reported 
jealousy in women (r=0.88), but not men (r=- 
0.07). At first blush, the scattergrams and 
correlations exhibit a very striking interaction 
(reported as significant at p<0.001). However, 
the insula activity corresponds to the peak 
voxel of a cluster that passed statistical 
threshold for the correlation between rated 
jealousy and BOLD signal in males; thus the 
observed correlation with rated jealousy in 
males was non-independent and biased, while 
the same correlation for rated jealousy in 
females was independent. The pSTS activity 
was selected for correlating with rated 
jealousy in females, and thus only the jealousy 
correlation in males was independent in that 
region. 

It should come as no surprise, therefore, that 
such non-independently selected data 
produced a striking interaction in which the 
non-independent analyses showed high 
correlations while the independent analyses 



showed no correlation. Thus, the presence of 
the interaction, along with the magnitude of 
the correlations themselves, is quite 
meaningless and could have been obtained 
with completely random data like those 
utilized in the simulation shown in Figure 4. 

E. But is there really any viable alternative to 
doing these non-independent analyses? 

It is all very well to point out ways in which 
research methods fall short of the ideal. 
However, the ideal experiment and the ideal 
analysis are often out of reach, especially in 
fields like psychology and cognitive 
neuroscience. Perhaps we must settle for 
somewhat imperfect designs and methods to 
get any information whatsoever about across- 
subject brain-behavior correlations. Are any 
better methods available? 

We contend that the answer is a clear-cut 
“Yes”. These kinds of brain-behavior 
linkages can be readily investigated with 
designs that do not invite any of the rather 
disastrous complications that accompany the 
use of non-independent analyses. 

One method is to select the voxels comprising 
different regions of interest in a principled 
way that is “blind” to the correlations of those 
voxels with the behavioral measure and also 
mindful of the fact that individuals’ brains are 
far from identical. For instance, to assess the 
relationship between ACC activity during 
exclusion and reactions to social rejection 
measured in a questionnaire, one would first 
put the social rejection data aside, and not 
“peek” at it while analyzing the fMRI data. 
The researcher can then define regions of 
interest in individual subjects in whatever way 
seems appropriate; e.g., by identifying voxels 
within the anatomical confines of the ACC 
that were significantly active for the excluded- 
included contrast (or, even better, using a 
different contrast, or different data, altogether). 
Once a subset of voxels is defined within an 
individual subject, one number should be 
aggregated from these voxels (e.g., the mean 
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signal change). Only then are the behavioral 
data examined, and an unbiased correlation 
can be computed between the ACC region of 
interest and the behavioral measure. This 
method was used by a few of the authors of 
the current studies, e.g., Kross et al (2007). 
In addition to providing an unbiased measure 
of any relationships between evoked activity 
and individual differences, this ‘functional 
Region of Interest’ (fROI) method avoids 
implausible assumptions about voxel-wise 
correspondence across different individuals’ 
functional anatomy (Saxe, Brett, & 
Kanwisher, 2006). 

If one feels that it makes sense to draw 
voxelwise correspondences between the 
functional anatomy of one subject and another, 
a second alternative exists: a ‘split half’ 
analysis. Here, half of the data are used to 
select a subset of voxels exhibiting the 
correlation of interest, and the other half of the 
data are used to measure the effect (examining 
the same voxels, but looking at different runs 
of the scanner). For example, if there are 4 
runs in the social exclusion and 4 runs in the 
neutral condition, one can use 2 exclusion 
runs and 2 neutral runs to identify voxels that 
maximize the correlation, and then test the 
correlation of the behavioral trait with these 
same voxels-but looking only at the other 2 
runs. Such a procedure uses independent data 
for voxel selection and the subsequent 
correlation test, and thus avoids the non- 
independence error . This straightforward 



" Although it is possible for voxels registered to the 

‘average brain’ to be functionally matched across 

subjects, the variability in anatomical location of well- 

studied regions even in early visual cortex (VI, MT) 

and visual cognition (FFA) suggests to us that higher- 

level functions determining individual differences in 

personality and emotionality is not likely to be 

anatomically uniform across individuals (Saxe, Brett, & 

Kanwisher, 2006). 

1 2 

At first blush, one might worry that using only half 
of the data to select the correlated regions will greatly 
decrease statistical power. However, there are two 
reasons why this should not be a concern. First, 
removing half of the data from each subject does not 
reduce the number of data-points that go into the 



analysis may be computed on all of the 
suspect results noted in our paper thus far, and 
can be used to provide unbiased estimates of 
the correlations reported in these papers. 
Techniques of this kind (hold out validation 
and cross-validation) are used in a variety of 
fields (including fMRI) to evaluate the 
generality of conclusions when over-fitting is 
a possibility (Geisser, 1993) - as is the case 
when picking a small subset of many 
measured correlations as a measure of the true 
correlation. 

It may often be advisable to use both of the 
methods just described, because they may find 
slightly different kinds of (real) patterns in the 
data. The first type of analysis focuses on the 
voxels that are most active in the task contrast 
at issue. This is a sensible place to look first 
to find relationships with individual 
differences. However, it is possible that the 
behavioral individual differences may be most 
closely associated with activity in some subset 
of voxels which may not show the greatest 
activity in this contrast. For example, it is 
possible that within the ACC there could be 
neural structures whose magnitude of response 
is related to rejection, even if the mean 
activation in those structures across subjects 
does not differ from zero. 



across-subject correlation - it simply makes the 
estimate of BOLD activity for an individual subject 
more noisy (by a factor of sqrt(2)). This is not as 
detrimental to the ability to evaluate a correlation as 
reducing the number of data points. Second, stringent 
corrections for multiple comparisons are unnecessary 
for an independent split-half analysis, thus, a 
(reasonable) liberal threshold may be chosen to select 
the subset of voxels that correlate with the behavioral 
measure in the first half of the data. The statistical 
inference relies on the magnitude of the correlation 
observed in those voxels in the second half of the data - 
a single comparison, which will have ample power to 
detect any effect that may be close to significant in a 
properly corrected whole-brain analysis. For an even 
more data-efficient (but computationally intensive) 
independent validation technique, variants of the ‘k- 
fold’ method can also be used (Brieman & Spector, 
1992). 



16 




F. Even if correlations were overestimated 
due to non-independent analyses, can ’t we at 
least be sure the correlations are statistically 
significant (and thus that there exists a real, 
nonzero, correlation)? 

In most of the nonindependent analyses, the 
voxels included in the computation of the 
reported correlation were those that passed a 
threshold for significance that was based on 
some combination of the correlation value for 
each voxel and the spatial contiguity between 
the voxel and other elevated voxels— a 
threshold that typically included some 
ostensible adjustment for multiple 
comparisons. Given that, can we not be sure 
that there is a real, albeit weaker-than-reported, 
correlation? In principle, this ought to be the 
case - but only if the correction for multiple 
comparisons is appropriately implemented. 

We did not explicitly survey the authors about 
their multiple comparisons correction 
procedures, but we do see evidence that the 
corrections used in this literature may often be 
less than trustworthy. The most common 
method of correcting for multiple comparisons 
used in this literature is family-wise error 
correction relying on “minimum cluster size 
thresholds’’^^. In this approach, the correlation 
in clusters of voxels is determined to be 
significant if the cluster contains a sufficiently 
large number of contiguous voxels each 
exceeding some statistical threshold. This 
procedure “relies on the assumption that areas 
of true neural activity will tend to stimulate 
signal changes over contiguous pixels” 
(Forman et ah, 1995), i.e., “signal” will tend 
to show up as activity that extends beyond a 
single voxel, whereas statistical noise will 
generally be independent from one voxel to its 
neighboring voxel and thus will not usually 
appear in large clusters 
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See Appendix 2 for a discussion of whether the 
problem of inflated correlations is eliminated by the use 
of a cluster-based threshold. 

Technically, the rationale is somewhat more 
complicated and relies on estimates of the spatial 



Given particular scan parameters*^, one can 
use various sophisticated techniques to 
compute the probability of falsely detecting a 
cluster of voxels (Type I error). This 
probability may be estimated using the 
AlphaSim tool from the program AFNI 
(Analysis for Functional Neuroimaging) 
(Cox, 1996; Douglas Ward). We noticed that 
many papers in our sample chose p-thresholds 
of 0.005 and cluster size thresholds of 10, and 
stated that these choices were made relying 
upon Forman et al. (1995) as an authority. For 
instance, Eisenberger, et al. (2003) claimed 
that their analysis had a per-voxel false 
positive probability of “less than 0.000001.” 
They used these thresholds on 19x64x64 
imaging volumes at 3.125x3.125x4 mm, 
smoothed with 8 mm full-width at half-max 
Gaussian kernel. We were puzzled that these 
parameters would be able to reduce the rate of 
false alarms to the degree claimed, and so we 
investigated using AlphaSim. According to 
the AlphaSim simulations, pure noise data is 
likely to yield a cluster passing this threshold 
in nearly 100% of all runs (a per-voxel false 
alarm probability of 0.002)! To hold the false 
detection probability for a particular cluster 
below .000003 (thus keeping the overall 



correlations known to be present in the voxels (e.g., due 
to smoothing). The smoothness assumption defines how 
likely it is for pure noise observations with these spatial 
statistics to contain clusters with a particular number of 
contiguous voxels exceeding statistical threshold. 

These parameters include: voxel dimensions, volume 
dimensions, smoothing parameter (sometimes data 
smoothness as estimated from the data), minimum 
cluster size, and minimum single-voxel p-threshold. 

The method used by AlphaSim allows users to enter 
an estimate of smoothness of the data by entering (the 
literal smoothing kernel is often an underestimate and a 
better estimate is to use the output of the FWHMx 
function, which computes a measure of ‘smoothness’ 
by measuring the spatial correlation in the data in 
addition to the smoothing parameter applied - this is 
default in SPM). Thus, simply entering the smoothing 
kernel into AlphaSim underestimates the smoothness of 
the data, and underestimates the probability of a falsely 
detected cluster. For our purposes, this means that the 
numbers obtained from AlphaSim will actually 
underestimate how large the clusters must be to reach a 
certain false alarm probability. 
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probability of a false positive in the analysis 
below the commonly desired alpha level of 
0.05), a far larger cluster size (namely, 56 
voxels) would need to be used^^. Thus, we 
suspect that the .000001 figure cited by 
Eisenberger et al. (2003) and other authors 
actually reflects a misinterpretation of 
Forman’s simulations results . It seems that 
ostensible corrections for multiple 
comparisons with the cluster size method are 
at least sometimes misapplied, and thus, even 
the statistical significance of some correlations 
in this literature may be questionable. 

In general, it is important to keep in mind 
what statistics the conclusions of a particular 
paper rely on. In many papers, a liberal 
threshold is used to select an ROI (one that 
would be insufficiently conservative to 
address the multiple comparisons problem), 
and then an independent secondary statistic is 
computed on the ROI voxels. The 
conclusions of such papers usually rest on the 
secondary statistic computed within the ROI; 
what threshold was used to select the ROI 
voxels does not matter so much. In the cases 
we discuss in this paper, the secondary 
statistics are non-independent, and are thus 



Even if the brain occupied just one tenth of the 
imaging volume (7,700 voxels), the parameters 
described would falsely detect a cluster 60% of the time 
in pure noise - in this case, the appropriate minimum 
cluster size threshold would need to be 27, rather than 
10, to reach a false detection rate of 0.05. 

The per-voxel false detection probabilities described 
by Eisenberger et al (and others) seem to come from 
Eorman et al.’s Table 2C. Values in Eorman et al’s 
table report the probability of false alarms that cluster 
within a single 2D slice (a single 128x128 voxel slice, 
smoothed with a EWHM of 0.6*voxel size). However, 
the statistics of clusters in 2D (a slice) are very different 
from those of a 3D volume: there are many more 
opportunity for spatially clustering false alarm voxels in 
the 3D case, as compared to the 2D case. Moreover, the 
smoothing parameter used in the papers in question was 
much larger than 0.6*voxel size assumed by Eorman in 
Table 2C (in Eisenberger et al., this was >2* voxel size). 
The smoothing, too, increases the chances of false 
alarms appearing in larger spatial clusters. 



biased and meaningless. In these cases, the 
criteria used to select voxels becomes the only 
statistic which may legitimately be used to 
evaluate the results; thus, the selection criteria 
are of utmost importance for the conclusions 
of the paper. 

It should be emphasized that we certainly do 
not contend that problems with corrections for 
multiple comparisons exist in all (or even a 
majority) of the papers surveyed. Many 
comparisons are corrected in a defensible 
fashion. Moreover, even papers using 
multiple comparisons corrections that, strictly 
speaking, rely on assumptions that were not 
really met, may report relationships that do 
indeed exist at least to some nonzero extent. 
In any case, we argue that (a) the actual 
correlation values reported by the non- 
independent analyses comprising over half of 
the studies we examined are sure to be inflated 
to the point of being completely untrustworthy, 
(b) assertions of statistical significance based 
on non-independent analyses require careful 
scrutiny — which does not always appear to 
have been done in the publication process. 
Perhaps most importantly, we argue (c) that if 
researchers would use the approaches 
recommended above (see Question D) they 
could avoid the whole treacherous terrain of 
non-independent analyses and its attendant 
uncertainties and complexities. In this way, 
the statistics would only need to be done once, 
the false alarm risk would be completely 
transparent, and there would be no need to use 
highly complex corrections for multiple 
comparisons that rest on hard-to-assess 
assumptions. 

G. Well, in those cases where the correlation 
really is significant (i.e., nonzero), isn’t that 
what matters, anyway? Does the actual 
correlation value really matter so much? 

We contend that the magnitude, rather than 
the mere existence, of the correlation is what 
‘really matters’. A correlation of 0.96 (as in 
Sander et al., 2005), indicates that 92% of the 



18 




variance in proneness to anxiety is predicted 
by the right cuneus response to angry 
speech. A relationship of such strength would 
be a milestone in understanding of brain- 
behavior linkages, full of promise for potential 
diagnostic and therapeutic spin-offs. In 
contrast, suppose — and here we speak purely 
hypothetically— the true correlation in this case 
were 0.1, accounting for 1% of the variance. 
The practical implications would be far less, 
and the scientific interest would be greatly 
reduced as well. A correlation of 0. 1 could be 
mediated by a wide variety of highly indirect 
relationships devoid of any generality or 
interest. For instance, proneness to anxiety 
may lead people to breathe faster, drink more 
coffee, or make slightly different choices in 
which lipids they ingest. All of these are 
known to have effects on BOLD responses 
(Weckesser et al, 1999; Mulderink et ah, 2002; 
Noseworthy et al, 2003), and those effects 
could easily interact slightly with the specific 
hemodynamic responses of different brain 
areas. Or perhaps anxious people are more 
afraid than others of failing to follow task 
instructions and attend ever so slightly more to 
the required auditory stream. The weaker the 
correlation, the greater the number of indirect 
and uninteresting causal chains that might be 
accounting for it, and the greater the chance 
that the effect itself will appear and disappear 
in different samples in a completely 
inscrutable fashion (e.g., if the dietary 
propensities of anxious people in England 
differ from those of anxious people in Japan). 
We suspect that it is for this reason that the 
field of risk-factor epidemiology is said to 
have reached some consensus that findings 
involving modest but statistically significant 
risk ratios (e.g., ratios between 1.0 and 2.0) 
have not generally proven to be robust or 
important. It seems likely to us that most 
reviewers in behavioral and brain sciences 
also implicitly view correlation magnitude as 
important, and we suspect that the very fact 
that so many of the studies reviewed here 
appeared in high-impact journals partly 



reflects the high correlation values they 
reported. 

Concluding Remarks 

We began this article by arguing that many 
correlations reported in recent fMRI studies 
on emotion, personality, and social cognition 
are “impossibly high”. Correlations of this 
magnitude are unlikely to occur even if one 
makes the (implausible) assumption that the 
true underlying correlations — the correlations 
that would be observed if there were no 
measurement error — are perfect. We then 
went on to describe our efforts to figure out 
how these impossible results could possibly be 
arising. While the method sections of articles 
in this area did not provide much information 
about how analyses were being done, a survey 
of researchers provided a clear and worrisome 
picture. Over half of the investigators in this 
area used methods that are guaranteed to offer 
greatly inflated estimates of correlations. As 
seen in Figure 5, these procedures turn out to 
be associated with the great majority of the 
correlations in the literature that struck us as 
impossibly high^^. 

Interestingly, we suspect that the problems 
brought to light here are ones that most editors 
and reviewers of studies using purely 
behavioral measures would usually be quite 
sensitive to. Suppose an author reported that a 
questionnaire measure was correlated with 
some target behavioral measure at r=.85, and 
that this number was arrived at by separately 
computing the correlation between the target 
measure and each of the items on the 
questionnaire, and reporting just the average 
of the highest-correlated questionnaire items. 
Moreover, to assess whether these highest- 
correlated questionnaire items were just the 
tail of a chance distribution across the many 
items, a filtering procedure had been used 
with properties too complex to derive 
analytically. We believe that few prestigious 



The others (high green numbers in Figure 5) could 
simply reflect normal sampling variability of the sort 
found with any kind of imperfect measurement. 
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psychology journals would publish such 
findings. It may be that the problems are not 
being reeognized in this field beeause of the 
relative unfamiliarity of the measures, and the 
relatively greater eomplexity of the data 
analyses. Moreover, perhaps the faet that the 
papers report using proeedures that inelude 
some preeautions relating to the issue of 
multiple eomparisons leads reviewers to 
assume that sueh matters are all well taken 
eare of. 

As diseussed above, one thing our eonelusions 
leave open is whether, behind any given 
inflated eorrelation, there is at least some real 
relationship — i.e. a true eorrelation higher 
than zero. Most investigators used thresholds 
that ostensibly eorreet for multiple 
eomparisons but, we have argued, in some 
eases these eorreetions were seriously 
misapplied. Based on the analysis deseribed 
above, we suspeet that while in many eases 
the reported relationships probably refleet 
some underlying relationship (albeit a mueh 
weaker relationship than the numbers in the 
artieles implied), it is quite possible that a 
eonsiderable number of relationships reported 
in this literature are entirely illusory. 

To sum up, then, we are led to eonelude that a 
disturbingly large, and quite prominent, 
segment of fMRI researeh emotion, 
personality, and soeial eognition is using 
seriously defeetive researeh methods and 
produeing a profusion of numbers that should 
not be believed. Although we have foeused 



here on studies relating to emotion, 
personality, and soeial eognition, we suspeet 
that the questionable analysis methods 
diseussed here are also widespread in other 
fields that use fMRI to study individual 
differenees, sueh as eognitive neuroseienee, 
elinieal neuroseienee, and neurogeneties. 

A Suggestion to Investigators 

Despite the dismal seenario painted in the last 
paragraph, we ean end on a mueh more 
positive note. We pointed out earlier how 
investigators eould have explored these 
behavioral trait- brain aetivity eorrelations 
using methods that do not have any of the 
logieal and statistieal defieieneies deseribed 
here. The good news is that in almost all 
oases the correct (and simpler) analyses can 
still be performed. It is routine, and often 
required by journals and funders, for large 
neuroimaging data sets (whioh have usually 
been oolleoted at great eost to publio agenoies) 
to be arohived. Therefore, in most oases it is 
not too late to perform the analyses advoeated 
here (or possibly others that also avoid the 
problem of non-independenoe). Thus, we 
urge investigators whose results have been 
questioned here to perform sueh analyses and 
to eorreet the reeord by publishing follow-up 
errata that provide valid numbers. At present, 
all studies performed using these methods 
have large question marks over them. 
Investigators ean erase these question marks 
by re-analyzing their data with appropriate 
methods. 
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APPENDIX 1: fMRI Survey Question Text 

Would you please be so kind as to answer a few very quiek questions about the analysis that 
produced, i.e., the correlations on page XX. We expect this will just take you a minute or two at 
most. 

To make this as quick as possible, we have framed these as multiple choice questions and listed the 
more common analysis procedures as options, but if you did something different, we'd be obliged if 
you would describe what you actually did. 

The data plotted reflect the percent signal change or difference in parameter estimates (according to 
some contrast) of... 

1. ...the average of a number of voxels. 

2. ...one peak voxel that was most significant according to some functional measure. 

3. ...something else? 

If 1: 

The voxels whose data were plotted (i.e., the "region of interest") were selected based on... 

la. ...only anatomical constraints (no functional data were used to define the region, e.g., all voxels 
representing the hippocampus). 

lb. ...only functional constraints (voxels were selected if they passed some threshold according to a 
functional measure - no anatomical constraints were used; e.g., all voxels significant at p<.0001, 
or all voxels within a 5 mm radius of the peak voxel) 

l c. ...anatomical and functional constraints (voxels were selected if they were within a particular 
region of the brain and passed some threshold according to a functional measure; e.g., all voxels 
significant at p<.0001 in the anterior cingulate) 

l d. ...something else? 

If you picked [Ib, Ic, or 2] above could you please advise us about the following: 

The functional measure used to select the voxel(s) plotted in the figure was... 

[A] . ...a contrast within individual subjects (e.g., condition A greater than condition B at some p 
value for a given subject) 

[B] . . . .the result of running a regression, across subjects, of the behavioral measure of interest 
against brain activation (for a contrast) at each voxel. 

[C] . ...something else? 

Finally: the fMRI data (runs/blocks/trials) displayed in the figure were... 

[A] . ...the same data as those employed in the analysis used to select voxels (the functional 
localizer). 

[B] . ...different data from those employed in the analysis used to select voxels (the functional 
localizer). Thank you very much for giving us this information so that we can describe your 
study accurately in our review. 
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APPENDIX 2 



G. Most papers use cluster size, not just a high threshold, to capture correlations. Does the 
injlation of correlation problem still exist in this case? 

Yes. The problem arises from imposing any threshold which does not capture the full distribution of 
the ‘true effect’. Since any true signal will also be corrupted by measurement noise, measurements 
of voxels that really do correlate with the behavioral measure of interest will also produce a 
distribution (although in this case the distribution will have a mean with a value that differs from 
zero). Imposing a threshold on this distribution will select only some samples - those with more 
favorable patterns of noise. If nearly the whole distribution is selected (statistical power is nearly 1) 
and there are no false alarm clusters, there would be no inflation. However, the lower the power, the 
more biased the selected subsample. Although cluster-size correction methods effectively increase 
power, they do not increase it sufficiently to mitigate bias. For simple whole -brain contrasts, cluster- 
size methods, appear to provide power that does not exceed 0.4 (and will more likely be substantially 
lower than that; Friston, Holmes, Poline, Price, and Frith, 1995). If statistical power is at 0.4, that 
means that only the top 40% of the true distribution will be selected - the mean of these selected 
samples will be very much higher than the true mean. 




Figure A3: Simulation of cluster size correction and measure variable inflation. 

For the moderately technical audience we provide a simplified cluster-size threshold simulation to 
show the magnitude with which the underlying signal can be inflated by an analysis procedure of 
roughly the sort we describe in this article. We generated a random 1000x1000 voxel slice (300x300 
subset shown; the dimensions are irrelevant in our case, because we had a constant proportion of 
signal voxels) by generating random noise for each voxel (gaussian noise with mean 0 and standard 
deviation of 3.5). We blurred this slice with gaussian smoothing (kernel standard deviation = 2), 
thus inducing a spatial correlation between voxels, and resulting in an effective standard deviation of 
0.5 per voxel. We then added ’’signals” to this noise: Signals were square “pulses” added to 
randomly chosen 5X5 sub-regions of the matrix. Within one simulated matrix, 25% of the voxels 
were increased by 1 . The color map shows measured intensity of a given voxel, with 0 being the 
noise average, 1 (marked with a *) the signal average. 
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We then did a simple cluster-search (finding 5x5 regions in which every voxel exceeded a particular 
threshold). We tried a number of different height thresholds, and for each threshold we measured 
the probability of a false alarm (the probability that a voxel that was within a 5x5 region in which all 
voxels passed threshold did not contain a true signal) — the logarithm (base 10) of this probability is 
the X axis (-2 corresponds to p(FA) = 0.01, -0.3: p(FA) = 0.5). We also computed the inflation of the 
measured signal compared to the true signal in the detected voxels, as a percentage of true mean 
voxel amplitude; this is plotted on the y axis. “**” on the x-axis corresponds to simulated thresholds 
that did not produce any false alarm voxels in our simulations, thus, those reflect only regions that 
were entirely composed of signals. Error bars correspond to +/- 1.96 standard deviations across 
simulations for each threshold. (Naturally, low thresholds are on the right of the graph, producing 
many false alarms, high thresholds are on the left, producing few, if any, false alarms). A crude 
summary of the results of this simulation is that taking only signals that pass a threshold always 
inflates the underlying signal rather seriously (given thresholds that have a reasonable probability of 
false alarm), and as thresholds are raised to decrease false alarms, the signal inflation becomes even 
greater. (Matlab code available upon request) 
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